Created by the creative mind of J.K. Rowling, the magical world of Harry Potter has captured the attention of readers all over the world with its vibrant characters, tight plots, and rich narrative fabric. The saga, which spans seven books and eight film adaptations, chronicles the adventures of a young wizard named Harry Potter as he battles the powerful dark forces of the wizarding world and navigates the turbulent worlds of magic, friendship, and destiny.
Beginning with “Harry Potter and the Philosopher’s Stone” and ending with “Harry Potter and the Deathly Hallows,” the series explores themes of courage, love, and the never-ending conflict between good and evil. Readers and viewers are introduced to many memorable characters in addition to Harry, such as the mysterious Albus Dumbledore, the devoted Ron Weasley, and the unwavering Hermione Granger.
Through the use of a variety of analytical techniques, I hope to uncover hidden themes, patterns, and insights within the text as I delve deeper into the literary and cinematic realms of the Harry Potter saga on this text mining journey. I strive to reveal the underlying structures and subtleties that add to the enduring magic of Rowling’s masterpiece and to solve the mysteries and magic of Harry Potter through the lens of text mining through natural language processing, sentiment analysis, and topic modeling.
Lastly, my initial guess or hypothesis is that I will oberve a decline in positive sentiment as the series goes on, as I believe it becomes more dramatic.
The first thing that should be done in order to start with this analysis is to load the libraries and import the data, which is inside the a package named “harrypotter” which contains the whole script of each of the seven J.K. Rowling books:
#install/update the library that contains the harry potter books data
if (packageVersion("devtools") < 1.6) {
install.packages("devtools")
}
devtools::install_github("bradleyboehmke/harrypotter")
library(stringr) # String manipulation functions
library(gridExtra) # Arrange multiple grid-based plots on one page
library(harrypotter) # Provides access to text data related to Harry Potter series
library(tidyverse) # Collection of packages for data manipulation and visualization (including dplyr, ggplot2, tidyr, etc.)
library(tidytext) # Text mining and analysis using tidy principles
library(tibble) # Provides data frames with more modern features
library(tm) # Text mining framework for R (DMTs)
library(ggplot2) # Data visualization package
library(scales) # Provides tools for scaling plots and axes
library(textdata) # Access to text datasets
library(RColorBrewer) # Color palettes for creating attractive graphics
library(wordcloud) # Create word clouds from text data
library(reshape2) # Reshape and aggregate data
library(forcats) # Tools for working with factors
library(igraph) # Network analysis and visualization (n-grams)
library(quanteda) # Quantitative analysis of textual data (corpus)
library(topicmodels) # To do topic modelling
Each text/book is in a character vector in which each element representing a single chapter.
For example:
str(prisoner_of_azkaban)
## chr [1:22] " OWL POST Harry Potter was a highly unusual boy in many ways. For one thing, he hated the summer holidays"| __truncated__ ...
This book has 22 chapters
Now we need to create a function to convert Harry Potter novels into a tibble that has one word by chapter by book.
# Creating vectors to store book titles and their corresponding texts:
book_titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban","Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince","Deathly Hallows")
books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban,goblet_of_fire, order_of_the_phoenix, half_blood_prince,deathly_hallows)
# Creating an empty tibble:
harry_potter_books <- tibble()
# Looping through the book titles
for(i in seq_along(book_titles)) {
#saving the cleaned text in the "clean" dataset
# Each chapter is represented as a different element inside the vector, so the way to access it is book[chapter]
clean <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) |>
#tokenize the text into words.
unnest_tokens(word, text) |>
#assign book titles
mutate(book = book_titles[i]) |>
#first column is the book
select(book, everything())
#stack the books together in rows
harry_potter_books <- rbind(harry_potter_books, clean)
}
# Set levels for the books according to their order of publication
harry_potter_books$book <- factor(harry_potter_books$book, levels = rev(book_titles))
#eliminate the unnecessary dataset
rm(clean)
#This is our final dataset
harry_potter_books
## # A tibble: 1,089,386 × 3
## book chapter word
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 the
## 2 Philosopher's Stone 1 boy
## 3 Philosopher's Stone 1 who
## 4 Philosopher's Stone 1 lived
## 5 Philosopher's Stone 1 mr
## 6 Philosopher's Stone 1 and
## 7 Philosopher's Stone 1 mrs
## 8 Philosopher's Stone 1 dursley
## 9 Philosopher's Stone 1 of
## 10 Philosopher's Stone 1 number
## # ℹ 1,089,376 more rows
Besides the commonly used stopwords, I might need to filter some more specific of the HP books. Nonetheless, up to now all the stopwords I have identified are present on the stop_words tibble from the tidytext. Perhaps later more stopwords or less useful words to be removed stand out when other types of analyses are performed.
stop_words #tidytext stop words
## # A tibble: 1,149 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ℹ 1,139 more rows
Now we filter our novels:
#anti_join keeps in data1 the words not present in the intersection between data1 and data2
harry_potter_books <- harry_potter_books |> #data1
anti_join(stop_words, join_by(word))#data2
#we filter and keep all of the words that are not in both tables (harry_potter_books & stop_words)
harry_potter_books
## # A tibble: 409,338 × 3
## book chapter word
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 boy
## 2 Philosopher's Stone 1 lived
## 3 Philosopher's Stone 1 dursley
## 4 Philosopher's Stone 1 privet
## 5 Philosopher's Stone 1 drive
## 6 Philosopher's Stone 1 proud
## 7 Philosopher's Stone 1 perfectly
## 8 Philosopher's Stone 1 normal
## 9 Philosopher's Stone 1 people
## 10 Philosopher's Stone 1 expect
## # ℹ 409,328 more rows
harry_potter_books |>
count(word, sort = TRUE)
## # A tibble: 23,795 × 2
## word n
## <chr> <int>
## 1 harry 16557
## 2 ron 5750
## 3 hermione 4912
## 4 dumbledore 2873
## 5 looked 2344
## 6 professor 2006
## 7 hagrid 1732
## 8 time 1713
## 9 wand 1639
## 10 eyes 1604
## # ℹ 23,785 more rows
The most commonly occurring words in the Harry Potter book series are predominantly the names of the main characters, terms associated with the school where much of the saga unfolds, and magical terminology like “wand.”
Let’s make a simple plot with words and frequencies:
harry_potter_books |>
count(word, sort = TRUE) |>
#only words mentioned over 800 times in the novels
filter(n > 800) |>
#we reorder words by number of mentions
mutate(word = reorder(word, n)) |>
#we create the plot with the word (x) and the number of mentions (y)
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL)
As mentioned, we can see that the most frequent words are the character´s names or terms related to the magical context of the movies such as Hogwarts, wand or dark.
frequency <- harry_potter_books |>
#regex to identify words and not _words_
mutate(word = str_extract(word, "[a-z']+")) %>% #finding everything starting from a to z, any word
#we count number of mentions of a word for an author
count(book, word) %>%
#we calculate proportion over the total sum of words
group_by(book) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
#we reshape the dataframe
#pivot wider means: more columns, less rows
pivot_wider(names_from = book, values_from = proportion) %>%
#pivot longer means: more rows, less columns
pivot_longer(`Chamber of Secrets`:`Deathly Hallows`,
names_to = "book", values_to = "proportion") |>
arrange(desc(proportion))
frequency
## # A tibble: 136,086 × 4
## word `Philosopher's Stone` book proportion
## <chr> <dbl> <chr> <dbl>
## 1 harry 0.0424 Chamber of Secrets 0.0447
## 2 harry 0.0424 Prisoner of Azkaban 0.0443
## 3 harry 0.0424 Half-Blood Prince 0.0411
## 4 harry 0.0424 Goblet of Fire 0.0404
## 5 harry 0.0424 Order of the Phoenix 0.0385
## 6 harry 0.0424 Deathly Hallows 0.0381
## 7 ron 0.0143 Chamber of Secrets 0.0194
## 8 ron 0.0143 Prisoner of Azkaban 0.0168
## 9 hermione 0.00899 Deathly Hallows 0.0148
## 10 hermione 0.00899 Prisoner of Azkaban 0.0146
## # ℹ 136,076 more rows
Perhaps it´s a bit clearer plotted. My preference would be perhaps to observe word frequencies accross the saga in relation to the first book, as it is the one that sets the tone and context for the rest of the books:
# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = `Philosopher's Stone`,
color = abs(`Philosopher's Stone` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
#you can use geom_jitter to adjust the points location and gain visibility
geom_jitter(alpha = 0.1, size = 0.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 0.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001),
low = "darkslategray4", high = "gray75") +
facet_wrap(~book, ncol = 2) +
theme(legend.position="none") +
labs(y = "Philosopher's Stone", x = NULL)
The words further from the line are the ones that will precisely give us some insight of what the book is about as the words represented far from the line are words that are found more in one book than in the reference one (Philosopher’s Stone).
For example, in “The Half-Blood Prince” they play a Quidditch match on the field, and also throughout the book there are many messages of “do not panic” sent from the minister of magic. Also in that movie a significant portion of the narrative focuses on Harry’s efforts to obtain Professor Slughorn’s authentic memory, which reveals his disclosure of crucial information to Voldemort regarding Horcruxes.
In general we can see that the second book (Chamber of secrets) is related somehow to a chamber, secrets, spiders as well as something pure and Ginny, the third (Prisoner of Azkaban) to the minister of magic, law, a cage, to the castle and to Sirius Black, the fourth book (Goblet of fire) to the magical ministry, to a tent (probably for the each team of the games), Sirius is again present as well as Lord (Voldemort). Regarding the fifth book (Order of the Phoenix) Ginny and Sirius are again present and death and Lord (voldemort) are words that become more present, in the sixth book (Half-Blood Prince) Dumbledore’s death is quite important as well as trying to retrieve something from his memory (the horocruxes). Lastly, in the seventh book (Deathly Hallows) words like death, Voldemort, wand, sword (of Gryffindor, used to destroy horocruxes), jinx, tent (in which the main characters spend days in the middle of the forest in the search of horocruxes) etc…
To undertand how similar the content of each book of the saga is to the first book of the saga, let’s quantify now how similar and different the previous sets of word frequencies are by using Pearsoncorrelation coefficient (cor.test). Let´s check book by book, always taking the first book as reference:
Chamber of Secrets
cor.test(data = frequency[frequency$book == "Chamber of Secrets",],
~ proportion + `Philosopher's Stone`)
##
## Pearson's product-moment correlation
##
## data: proportion and Philosopher's Stone
## t = 169.57, df = 3419, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9416928 0.9488229
## sample estimates:
## cor
## 0.9453708
cor: 0.9453708
Prisoner of Azkaban
cor.test(data = frequency[frequency$book == "Prisoner of Azkaban",],
~ proportion + `Philosopher's Stone`)
##
## Pearson's product-moment correlation
##
## data: proportion and Philosopher's Stone
## t = 174.95, df = 3540, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9432233 0.9500586
## sample estimates:
## cor
## 0.9467475
cor: 0.9467475
Half-Blood Prince
cor.test(data = frequency[frequency$book == "Half-Blood Prince",],
~ proportion + `Philosopher's Stone`)
##
## Pearson's product-moment correlation
##
## data: proportion and Philosopher's Stone
## t = 144.52, df = 3813, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9145315 0.9243378
## sample estimates:
## cor
## 0.9195777
cor: 0.9195777
Goblet of Fire
cor.test(data = frequency[frequency$book == "Goblet of Fire",],
~ proportion + `Philosopher's Stone`)
##
## Pearson's product-moment correlation
##
## data: proportion and Philosopher's Stone
## t = 178.36, df = 3905, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9402220 0.9470846
## sample estimates:
## cor
## 0.9437549
cor: 0.9437549
Order of the Phoenix
cor.test(data = frequency[frequency$book == "Order of the Phoenix",],
~ proportion + `Philosopher's Stone`)
##
## Pearson's product-moment correlation
##
## data: proportion and Philosopher's Stone
## t = 171.05, df = 4130, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9322311 0.9397805
## sample estimates:
## cor
## 0.9361135
cor: 0.9361135
Deathly Hallows
cor.test(data = frequency[frequency$book == "Deathly Hallows",],
~ proportion + `Philosopher's Stone`)
##
## Pearson's product-moment correlation
##
## data: proportion and Philosopher's Stone
## t = 132.75, df = 3884, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8993709 0.9107364
## sample estimates:
## cor
## 0.9052154
cor: 0.9052154
The saga is quite long, so it makes sense that the two books that have the lowest correlation with the first one (although extremely strong still, and thus keeping the essence of the saga) are the two last ones. Perhaps because they gravitate more around finding the Horcruxes. Therefore the two “most differet” books in relation to the first one are the Half-Blood Prince and the Deathly Hallows
In order to perform a sentiment analysis of the text I will use the
three general-purpose sentiment lexicons from the tidytext
package:
get_sentiments("afinn") # score between -5 and 5
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
get_sentiments("bing") #positive/negative
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
get_sentiments("nrc") #binary categorization of many sentiments
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
I will work with the harry_potter_books dataframe, where we can find three columns: book, chapter and word, so it´s already tokenised.
Since besides magic there are a lot of scary moments in the Harry Potter saga, so let´s try to see what are the most common fear words used in the first book (Philosopher’s Stone)
#we set nrc lexicon to fear
nrc_fear <- get_sentiments("nrc") |>
filter(sentiment == "fear")
harry_potter_books |>
#we choose the book Philosopher's Stone
filter(book == "Philosopher's Stone") |>
#we combine both lists, NRC and Philosopher's Stone´s words
inner_join(nrc_fear) %>%
#we count the mentions of each word to find the most frequent
count(word, sort = TRUE) |>
#Filter them by frequency (only mentioned more than X times)
filter(n > 10) |>
#Reorder column word by number of mentions (most frequents on top)
mutate(word = reorder(word, n)) %>%
#Create the plot with x=n, y=word |>
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL)
The previous plot reveals many key fear-inducing elements, which align with the challenges and dangers Harry Potter faces throughout the book. Words like “fire,” “dragon,” “troll,” “giant,” and “snake” reflect the physical threats and dangerous situations encountered by Harry and his friends. These elements contribute to a sense of danger and suspense within the narrative.
Additionally, words like “bad,” “horrible,” “pain,” “terrible,” and “die” evoke emotional and psychological fears, highlighting the characters’ inner struggles and the darker aspects of their experiences. The presence of words like “scar,” “fang,” and “mad” also make reference to specific characters or events that evoke fear in the story, such as Voldemort’s mark on Harry, dangerous creatures like the giant snake (of which they pull of its fang) , or the malevolent intentions of antagonistic characters.
Let´s also see this on the last book (Deathly Hallows), which is much more dramatic and scary in that sense:
harry_potter_books |>
#we choose the book Philosopher's Stone
filter(book == "Deathly Hallows") |>
#we combine both lists, NRC and Philosopher's Stone´s words
inner_join(nrc_fear) %>%
#we count the mentions of each word to find the most frequent
count(word, sort = TRUE) |>
#Filter them by frequency (only mentioned more than X times)
filter(n > 30) |>
#Reorder column word by number of mentions (most frequents on top)
mutate(word = reorder(word, n)) %>%
#Create the plot with x=n, y=word |>
ggplot(aes(n, word)) +
geom_col() +
labs(y = NULL)
This graph clearly represents more the book.
The prominence of words like “death,” “darkness,” “pain,” “kill,” “mad,” and “fear” reflects the heightened stakes and pervasive sense of dread that permeates the book.The most used fear word is death, it all revolves around either Harry Potter or Voldemort´s possible death in the final battle.
Also Harry´s scar becomes more relevant in this book. References to “scar,” “snake,” and “curse” evoke reminders of past traumas and ongoing conflicts, serving as constant reminders of the dangers faced by Harry and his allies. The word “broken” appears frequently, suggesting the shattered state of the magic world and the characters’ spirits given the danger of the situation.
We can see harsher words in general such as pain, curse, grave, shaking, kill, die…
It has become clear that the saga is filled with terror words, which are often negative, so to put this in context, we can analyze the top positive and negative words of the books with the Bing lexicon:
#create a data frame with word, sentiment and number of mentions
positive_negative_HP <- harry_potter_books |>
#we get sentiments from bing
inner_join(get_sentiments("bing")) |>
#count the number of mentions for each word
count(word, sentiment, sort = TRUE) |>
ungroup()
positive_negative_HP |>
group_by(sentiment) |>
#filter the top 15 most frequent words
slice_max(n, n = 20) |>
ungroup() |>
mutate(word = reorder(word, n)) |>
#plot the frequencies
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = TRUE) +
#organise the grid by sentiment and free Y
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
Top negative terms “dark” and “death” represent the omnipresent threat of Voldemort’s darkness and the seriousness of mortality, respectively, and depict the difficulties and perils that the characters must overcome. Words like “fell” and “hard” highlight the difficult journey and abrupt changes that the characters go through, while references to “moody” and “fudge” allude to the trauma and corruption that are a part of the magic world. “Scar” acts as an emotional remembrance of Harry Potter’s past and his association with the evil Voldemort.
On the other hand, the most positive words evoke the magic and happiness that soaks into the magic world. While terms like “gold” and “golden” denote victory, possibly of valued accomplishments and treasures within the narrative such as the Golden Snitch or the house´s trophy, words like “magic” and “magical” celebrate the series’ enchanting moments.
Nonetheless, I believe that the negative words are more representative of negative sentiment than positive words, as “top”, “people”, “looked”, “led” or “well” are words that are often also used in context without a negative or positive connotation.
We can add some of these words to an existent stopwords list that we
can find directly in stop_words.
custom_stop_words <- bind_rows(tibble(word = c("top","people", "well", "looked", "led", "yeah"),
lexicon = c("custom")),
stop_words)
custom_stop_words
## # A tibble: 1,155 × 2
## word lexicon
## <chr> <chr>
## 1 top custom
## 2 people custom
## 3 well custom
## 4 looked custom
## 5 led custom
## 6 yeah custom
## 7 a SMART
## 8 a's SMART
## 9 able SMART
## 10 about SMART
## # ℹ 1,145 more rows
tidy_HP <-harry_potter_books |>
#we filter stopwords
anti_join(custom_stop_words)
positive_negative_HP <- tidy_HP |>
#we get sentiments from bing
inner_join(get_sentiments("bing")) |>
#count the number of mentions for each word
count(word, sentiment, sort = TRUE) |>
ungroup()
positive_negative_HP |>
group_by(sentiment) |>
slice_max(n, n = 15) |>
ungroup() |>
mutate(word = reorder(word, n)) |>
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = TRUE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
That looks a bit better.
Now maybe we can better visualize word frquencies with wordclouds:
library(wordcloud)
#set the colors from a brewer palette. 8 colours from
colors = brewer.pal(8, 'BuPu')
tidy_HP |>
#we filter stopwords
anti_join(custom_stop_words) |>
#we count words
count(word) |>
#we use the wordcloud function with the colours argument
with(wordcloud(word, n, max.words = 80))
In the previous wordcloud the most frequent words of the saga are represented through its size in a wordcloud
Now we can go a step further and compare the wordclouds for positive and negative words:
library(reshape2)
tidy_HP |>
#we get sentiments
inner_join(get_sentiments("bing")) %>%
#we count word mentions
count(word, sentiment, sort = TRUE) %>%
#we establish criteria for size
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
#we paint two wordclouds in one using two different colors
comparison.cloud(colors = c("deepskyblue4", "deeppink4"),
max.words = 90)
We can also examine how sentiment changes throughout each book. To do so it is necessary to create some unit of analysis, which could be 80 lines. Nonetheless since all the books were straight away tokenised I don´t have a line number so let´s just take 800 words as a unit of analysis:
harry_potter_books <- harry_potter_books |>
mutate(wordcount = row_number())#create the wordcount as the row number
harry_potter_sentiment <- harry_potter_books %>%
#find the sentiment for each word using bing
inner_join(get_sentiments("bing")) %>%
#divide each book in chunks of 800 words
count(book, index = wordcount %/% 800, sentiment) %>%
#we write positive and negative in different columns
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
#we substract positive minus negative to find a net sentiment
mutate(sentiment = positive - negative)
harry_potter_sentiment
## # A tibble: 518 × 5
## book index negative positive sentiment
## <fct> <dbl> <int> <int> <int>
## 1 Deathly Hallows 419 5 4 -1
## 2 Deathly Hallows 420 70 33 -37
## 3 Deathly Hallows 421 93 36 -57
## 4 Deathly Hallows 422 68 72 4
## 5 Deathly Hallows 423 69 43 -26
## 6 Deathly Hallows 424 63 37 -26
## 7 Deathly Hallows 425 67 50 -17
## 8 Deathly Hallows 426 86 26 -60
## 9 Deathly Hallows 427 108 19 -89
## 10 Deathly Hallows 428 88 22 -66
## # ℹ 508 more rows
And now we can plot it:
#create the plot with x = index (chunks) and y = net sentiment
ggplot(harry_potter_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = TRUE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
Most of them are quite negative, and specially the last book of the saga. Nevertheless this is what I already expected given that it is a quite dramatic and scary saga, always revolving around death and black magic. It stands out that the book with the most dramatic ending is the Half-Blood Prince, at least in comparison to the Philosopher´s stone, which seems more vanilla next to it.
These sentiment analysis plots just displayed can be put together to get a better picture of the sentimental evolution of the narrative throughout the books of the saga:
# Load necessary libraries
library(tidyverse)
afinn <- get_sentiments("afinn")
# Compute sentiment scores for each word in the dataset
harry_potter_sentiment <- harry_potter_books %>%
inner_join(afinn)
# Group by book and chapter, then sum up sentiment scores for each chapter
sentiment_per_chapter <- harry_potter_sentiment %>%
mutate(series_chapter = cumsum(c(1, diff(chapter) != 0)))
sentiment_per_chapter <- sentiment_per_chapter|>
group_by(book, series_chapter) |>
summarise(total_sentiment = sum(value))|>
ungroup()
# Plot sentiment scores per chapter per book
ggplot(sentiment_per_chapter, aes(x = series_chapter, y = total_sentiment, group = book, color = book)) +
geom_line(size = 1.2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") + # Add horizontal line at y = 0
labs(title = "AFINN Sentiment Score per Chapter per Book in Harry Potter Saga",
x = "Series Chapter",
y = "Total Sentiment Score") +
theme_minimal() +
theme(legend.position = "bottom")
As we saw before all the books are quite negativealthough in the Half-blood Prince a few “happy moments” stand out. At the end nonetheless there is a last happy ending indicated by the last bit of the final line.
Lexicons are not always infallible, in fact, each carries its own subtle biases that can influence sentiment attribution. Hence, it becomes quite relevant to examine how the three previously defined lexicons might categorize the sentiment of one of the books.
In this analysis, I will delve into one of the most gripping books of the Harry Potter saga, the Half-Blood Prince, which stands out for its dramatic conclusionI
half_blood_prince <- harry_potter_books |>
filter(book == "Half-Blood Prince")
half_blood_prince
## # A tibble: 63,098 × 4
## book chapter word wordcount
## <fct> <int> <chr> <int>
## 1 Half-Blood Prince 1 nearing 272835
## 2 Half-Blood Prince 1 midnight 272836
## 3 Half-Blood Prince 1 prime 272837
## 4 Half-Blood Prince 1 minister 272838
## 5 Half-Blood Prince 1 sitting 272839
## 6 Half-Blood Prince 1 office 272840
## 7 Half-Blood Prince 1 reading 272841
## 8 Half-Blood Prince 1 memo 272842
## 9 Half-Blood Prince 1 slipping 272843
## 10 Half-Blood Prince 1 brain 272844
## # ℹ 63,088 more rows
#for AFINN we need to summarise quantities to get net sentiment.
afinn <- half_blood_prince |>
inner_join(get_sentiments("afinn")) |>
#get the sentiment for each chunk of 800 words
group_by(index = wordcount %/% 800) |>
summarise(sentiment = sum(value)) |>
mutate(method = "AFINN")
#for Bing and NRC we can do it in one step.
bing_and_nrc <- bind_rows(
#Bing
half_blood_prince |>
#we get sentiments from bing
inner_join(get_sentiments("bing")) |>
#we create the column for bing
mutate(method = "Bing et al."),
#NRC
half_blood_prince |>
#we get sentiment from nrc
inner_join(get_sentiments("nrc") |>
#we filter just sentiment, not emotions
filter(sentiment %in% c("positive",
"negative"))
) |>
#we create the column for nrc
mutate(method = "NRC")) |>
#we divide in chunks of 80 lines
count(method, index = wordcount %/% 800, sentiment) %>%
#we write positive and negative in different columns
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
#we extract net sentiment by substraction
mutate(sentiment = positive - negative)
Let´s look at the evolution given this positive-negative scale
afinn
## # A tibble: 79 × 3
## index sentiment method
## <dbl> <dbl> <chr>
## 1 341 -55 AFINN
## 2 342 -121 AFINN
## 3 343 -7 AFINN
## 4 344 -37 AFINN
## 5 345 -34 AFINN
## 6 346 -21 AFINN
## 7 347 -56 AFINN
## 8 348 -17 AFINN
## 9 349 30 AFINN
## 10 350 0 AFINN
## # ℹ 69 more rows
bing_and_nrc
## # A tibble: 158 × 5
## method index negative positive sentiment
## <chr> <dbl> <int> <int> <int>
## 1 Bing et al. 341 105 31 -74
## 2 Bing et al. 342 129 31 -98
## 3 Bing et al. 343 85 33 -52
## 4 Bing et al. 344 126 48 -78
## 5 Bing et al. 345 99 46 -53
## 6 Bing et al. 346 66 36 -30
## 7 Bing et al. 347 77 30 -47
## 8 Bing et al. 348 66 31 -35
## 9 Bing et al. 349 69 53 -16
## 10 Bing et al. 350 60 38 -22
## # ℹ 148 more rows
And finally, let’s bind the three of them together and visualize them in a plot:
#bind the three of them
bind_rows(afinn,
bing_and_nrc) %>%
#make the plot with x=index (chunks), y=sentiment and fill by lexicon (method)
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
The bing lexicon is much more negative in general and the AFINN is the most positive. They all detect a dramatic ending but for example the according to the bing lexicon it is represented as something gradual whereas according to the NRC as something sudden and very unexpected.
I would like to know given that this is a whole saga; which chapter has the highest proportion of negative words (using Bing)?
We have already seen more or less in the previous plots the sentiment evolution of the saga, so let´s give some sense and context to it by seeking the chapters with most negative words of each book.
First we filter the negative words from bing.
#filter negative words from Bing
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
bingnegative
## # A tibble: 4,781 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 4,771 more rows
Second, we need to create a dataframe with the number of words per chapter
# make a dataframe (wordcounts) with number of words per chapter
wordcounts <- harry_potter_books |>
group_by(book, chapter) |>
summarize(words = n())
wordcounts
## # A tibble: 200 × 3
## # Groups: book [7]
## book chapter words
## <fct> <int> <int>
## 1 Deathly Hallows 1 1237
## 2 Deathly Hallows 2 1598
## 3 Deathly Hallows 3 1256
## 4 Deathly Hallows 4 2118
## 5 Deathly Hallows 5 2195
## 6 Deathly Hallows 6 2255
## 7 Deathly Hallows 7 2401
## 8 Deathly Hallows 8 2487
## 9 Deathly Hallows 9 1521
## 10 Deathly Hallows 10 2451
## # ℹ 190 more rows
Third, create the ratio of the number of negative words to total words per chapter and filter to get the highest:
#find the number of negative words by chapter and divide by the total words in chapter
harry_potter_books |>
#semi_join: returns all words in books with a match in bingnegative
semi_join(bingnegative) |>
#group by book and chapter to summarize how many negative words by chapter
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
#left_join keeps all words in wordcounts and makes a dataframe
left_join(wordcounts, by = c("book", "chapter")) %>%
#create a column in the dataframe with the ratio
mutate(ratio = negativewords/words) %>%
#we don't want chapters 0 because they're just title and author
filter(chapter != 0) %>%
#we select the highest ratios
slice_max(ratio, n = 1) %>%
ungroup()
## # A tibble: 7 × 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Deathly Hallows 18 170 1244 0.137
## 2 Half-Blood Prince 1 259 1836 0.141
## 3 Order of the Phoenix 37 300 2646 0.113
## 4 Goblet of Fire 1 184 1470 0.125
## 5 Prisoner of Azkaban 17 181 1664 0.109
## 6 Chamber of Secrets 10 221 2139 0.103
## 7 Philosopher's Stone 17 213 1870 0.114
Stands out how on the Goblet of fire it´s the first one whereas for the rest of books the most negative chapter is either in the middle of the story or rather towards the end. But as we can see, the book with the chapter with most negative words is the Order of the Phoenix, and its chapter 37 stands out for it.
I will plot the most frequent words by book, but in order to make a meaningful analysis perhaps it´s best to filter out also the names of the main characters that often appear in almost every book (I will add them to the custom stop words tibble):
custom_stop_words <- bind_rows(tibble(word = c("top", "well", "led", "harry", "ron", "hermione", "weasley", "dumbledore", "professor", "malfoy", "potter", "snape", "harry's", "people", "looked", "yeah" ),
lexicon = c("custom")),
stop_words)
custom_stop_words
## # A tibble: 1,165 × 2
## word lexicon
## <chr> <chr>
## 1 top custom
## 2 well custom
## 3 led custom
## 4 harry custom
## 5 ron custom
## 6 hermione custom
## 7 weasley custom
## 8 dumbledore custom
## 9 professor custom
## 10 malfoy custom
## # ℹ 1,155 more rows
harry_potter_books |>
# delete stopwords
anti_join(custom_stop_words) |>
# summarize count per word per book
count(book, word) |>
# get top 15 words per book
group_by(book) |>
slice_max(order_by = n, n = 15) |>
mutate(word = reorder_within(word, n, book)) |>
# create barplot
ggplot(aes(x = word, y = n, fill = book)) +
geom_col(color = "black") +
scale_x_reordered() +
labs(
title = "Most frequent words in Harry Potter",
x = NULL,
y = "Word count"
) +
facet_wrap(facets = vars(book), scales = "free") +
coord_flip() +
theme(legend.position = "none")
Now deffinitely the most frequent words are way more representative of each book
Also wordclouds can be an excellent tool to get a picture of the context or the main things going on in each book, so let´s plot one for each:
book_titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban","Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince","Deathly Hallows")
create_wordcloud_grid <- function(book_titles, tidy_data) {
# Initialize an empty list to store word clouds
wordcloud_list <- list()
# Loop through each book title
for (title in book_titles) {
# Filter data for the current book title
filtered_data <- tidy_data %>%
filter(book == title) %>%
anti_join(custom_stop_words) %>%
count(word)
# Create word cloud for the current book title
wordcloud_list[[title]] <- wordcloud(words = filtered_data$word,
freq = filtered_data$n,#to plot the size
max.words = 80,
main = title)
}
}
# Call the function
create_wordcloud_grid(book_titles, tidy_HP)
As mentioned, names are the most frequent words.
We can create a dataframe with this information
#we obtain the text from the library
book_words <- harry_potter_books |>
count(book, word, sort = TRUE)
book_words
## # A tibble: 63,651 × 3
## book word n
## <fct> <chr> <int>
## 1 Order of the Phoenix harry 3730
## 2 Goblet of Fire harry 2936
## 3 Deathly Hallows harry 2770
## 4 Half-Blood Prince harry 2581
## 5 Prisoner of Azkaban harry 1824
## 6 Chamber of Secrets harry 1503
## 7 Order of the Phoenix hermione 1220
## 8 Philosopher's Stone harry 1213
## 9 Order of the Phoenix ron 1189
## 10 Deathly Hallows hermione 1077
## # ℹ 63,641 more rows
total_words <- book_words |>
#we group by books to sum all the totals in the n column of book_words
group_by(book) |>
#we create a column called total with the total of words by book
summarize(total = sum(n))
total_words
## # A tibble: 7 × 2
## book total
## <fct> <int>
## 1 Deathly Hallows 73406
## 2 Half-Blood Prince 63098
## 3 Order of the Phoenix 96777
## 4 Goblet of Fire 72663
## 5 Prisoner of Azkaban 41188
## 6 Chamber of Secrets 33621
## 7 Philosopher's Stone 28585
book_words
dataframe#we use left join because we need the join to keep all rows in book_words, regardless of repeating rows
book_words <- left_join(book_words, total_words)
book_words
## # A tibble: 63,651 × 4
## book word n total
## <fct> <chr> <int> <int>
## 1 Order of the Phoenix harry 3730 96777
## 2 Goblet of Fire harry 2936 72663
## 3 Deathly Hallows harry 2770 73406
## 4 Half-Blood Prince harry 2581 63098
## 5 Prisoner of Azkaban harry 1824 41188
## 6 Chamber of Secrets harry 1503 33621
## 7 Order of the Phoenix hermione 1220 96777
## 8 Philosopher's Stone harry 1213 28585
## 9 Order of the Phoenix ron 1189 96777
## 10 Deathly Hallows hermione 1077 73406
## # ℹ 63,641 more rows
With all this information we can perform the term frequency, which is the number of times a word appears in a novel divided by the total number of terms (words) in that novel.
book_words <- book_words |>
#we add a column for term_frequency in each novel
mutate(term_frequency = n/total)
book_words
## # A tibble: 63,651 × 5
## book word n total term_frequency
## <fct> <chr> <int> <int> <dbl>
## 1 Order of the Phoenix harry 3730 96777 0.0385
## 2 Goblet of Fire harry 2936 72663 0.0404
## 3 Deathly Hallows harry 2770 73406 0.0377
## 4 Half-Blood Prince harry 2581 63098 0.0409
## 5 Prisoner of Azkaban harry 1824 41188 0.0443
## 6 Chamber of Secrets harry 1503 33621 0.0447
## 7 Order of the Phoenix hermione 1220 96777 0.0126
## 8 Philosopher's Stone harry 1213 28585 0.0424
## 9 Order of the Phoenix ron 1189 96777 0.0123
## 10 Deathly Hallows hermione 1077 73406 0.0147
## # ℹ 63,641 more rows
#we calculate the distribution and put it in the x axis, filling by book
ggplot(book_words, aes(term_frequency)) +
#we create the bars histogram
geom_histogram(show.legend = TRUE) +
#we set the limit for the term frequency in the x axis
xlim(NA, 0.0009)
We have a long tail distribution with many words with very low frequencies and fewer with greater frequencies.
#we calculate the distribution and put it in the x axis, filling by book
ggplot(book_words, aes(term_frequency, fill = book)) +
#we create the bars histogram
geom_histogram(show.legend = TRUE) +
#we set the limit for the term frequency in the x axis
xlim(NA, 0.0002) +
#plot settings
facet_wrap(~book, ncol = 2, scales = "free_y")
Pilosopher´s stone for example has many words which can be grouped by similar frequencies in comparison to other books of the saga, yet together with the Chamber of Secrets it has least “rare” or less frequent words.
Zipf’s law (George Zipf) states that the frequency of a word appearance in a text is inversely proportional to its rank. This indicates that the lower the frequency, the higher the rank.
Thus it can be interesting to add to our dataframe the ranking of the words in descending order by their frequency in each book.
freq_by_rank <- book_words |>
group_by(book) |>
#we create the column for the rank with row_number by book
mutate(rank = row_number()) |>
ungroup()
freq_by_rank
## # A tibble: 63,651 × 6
## book word n total term_frequency rank
## <fct> <chr> <int> <int> <dbl> <int>
## 1 Order of the Phoenix harry 3730 96777 0.0385 1
## 2 Goblet of Fire harry 2936 72663 0.0404 1
## 3 Deathly Hallows harry 2770 73406 0.0377 1
## 4 Half-Blood Prince harry 2581 63098 0.0409 1
## 5 Prisoner of Azkaban harry 1824 41188 0.0443 1
## 6 Chamber of Secrets harry 1503 33621 0.0447 1
## 7 Order of the Phoenix hermione 1220 96777 0.0126 2
## 8 Philosopher's Stone harry 1213 28585 0.0424 1
## 9 Order of the Phoenix ron 1189 96777 0.0123 3
## 10 Deathly Hallows hermione 1077 73406 0.0147 2
## # ℹ 63,641 more rows
Let´s visualize Zipf´s law in the Harry Potter collection:
freq_by_rank |>
ggplot(aes(rank, term_frequency, color = book)) +
#plot settings
geom_line(linewidth = 1.1, alpha = 0.8, show.legend = TRUE)
It is mostly the Philosopher´s stone and the Chamber of secrets the books that use words with lowest rank. Yet this is better visualized on logarithmic scales.
freq_by_rank |>
ggplot(aes(rank, term_frequency, color = book)) +
geom_line(linewidth = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()
As we can see more or less all the books follow the same tendency in terms of the use of words. Nonetheless books like the Order of the Phoenix and Goblet of Fire stand out in the tail of lowest rank (also because they are longer books).
Measuring deviation:
If we break the previous plot in three sections that we assume to be three different usages of language, we can see that the middle section is the most stable one.
Let’s find the coefficients that define the relationship in this section between the tf and the rank and plot the deviation :
#we set the section in a variable called rank_subset
rank_subset <- freq_by_rank |>
filter(rank < 500,
rank > 10)
#we use the linear model function (lm) to find numeric coefficients of relationship between tf and rank
lm(log10(term_frequency) ~ log10(rank), data = rank_subset)
##
## Call:
## lm(formula = log10(term_frequency) ~ log10(rank), data = rank_subset)
##
## Coefficients:
## (Intercept) log10(rank)
## -1.7215 -0.6225
Coefficients:
(Intercept): -1.7215
log10(rank): -0.6225
This line now can be added to our plot to see the deviation from the standard use of language in the books
freq_by_rank |>
ggplot(aes(rank, term_frequency, color = book)) +
#we add a line in the plot with the two coefficients we have found
geom_abline(intercept = -1.7215, slope = -0.6225,
color = "gray50", linetype = 2) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()
Deviation in the third section means J.K. Rowling uses a lower percentage of the most common words than many collections of language
Deviation in the first section means J.K. Rowling uses a higher percentage of rare words than many collections of language.
The bind_tf_idf()
function in the tidytext package takes a tidy text data frame as input
with one row per token (term), per document. It only needs 3 columns:
book, word and n (frequency).
HP_tf_idf <- book_words |>
#create tf-idf column
bind_tf_idf(word, book, n)
HP_tf_idf
## # A tibble: 63,651 × 8
## book word n total term_frequency tf idf tf_idf
## <fct> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Order of the Phoenix harry 3730 96777 0.0385 0.0385 0 0
## 2 Goblet of Fire harry 2936 72663 0.0404 0.0404 0 0
## 3 Deathly Hallows harry 2770 73406 0.0377 0.0377 0 0
## 4 Half-Blood Prince harry 2581 63098 0.0409 0.0409 0 0
## 5 Prisoner of Azkaban harry 1824 41188 0.0443 0.0443 0 0
## 6 Chamber of Secrets harry 1503 33621 0.0447 0.0447 0 0
## 7 Order of the Phoenix hermione 1220 96777 0.0126 0.0126 0 0
## 8 Philosopher's Stone harry 1213 28585 0.0424 0.0424 0 0
## 9 Order of the Phoenix ron 1189 96777 0.0123 0.0123 0 0
## 10 Deathly Hallows hermione 1077 73406 0.0147 0.0147 0 0
## # ℹ 63,641 more rows
At the top, we find the words with very low TF-IDF, near zero, because these are words that occur in many of the documents in a collection
As previously noted and as suggested by Zipf’s law, when a word appears in numerous documents, it becomes less distinctive to any single one, resulting in higher TF-IDF scores for words with lower occurrence rates. Consequently, the TF-IDF analysis prioritizes these less frequent words. We will be therefore interested in the words with higher TF-IDF:
HP_tf_idf |>
#we exclude the total column which is not necessary now
select(-total) |>
#we arrange by tf-idf in descending order
arrange(desc(tf_idf))
## # A tibble: 63,651 × 7
## book word n term_frequency tf idf tf_idf
## <fct> <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Half-Blood Prince slughorn 335 0.00531 0.00531 1.25 0.00665
## 2 Order of the Phoenix umbridge 496 0.00513 0.00513 0.847 0.00434
## 3 Goblet of Fire bagman 208 0.00286 0.00286 1.25 0.00359
## 4 Chamber of Secrets lockhart 197 0.00586 0.00586 0.560 0.00328
## 5 Prisoner of Azkaban lupin 369 0.00896 0.00896 0.336 0.00301
## 6 Goblet of Fire winky 145 0.00200 0.00200 1.25 0.00250
## 7 Goblet of Fire champions 84 0.00116 0.00116 1.95 0.00225
## 8 Deathly Hallows xenophilius 79 0.00108 0.00108 1.95 0.00209
## 9 Half-Blood Prince mclaggen 65 0.00103 0.00103 1.95 0.00200
## 10 Deathly Hallows griphook 117 0.00159 0.00159 1.25 0.00200
## # ℹ 63,641 more rows
Let´s visualize the words with highest tf-idf of each book:
library(forcats)
HP_tf_idf |>
group_by(book) |>
#choose maximum number of words
slice_max(tf_idf, n = 20) |>
ungroup() |>
ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 3, scales = "free") +
labs(x = "tf-idf", y = NULL)
In “Philosopher’s Stone”, words such as “Quirrell,” “Flamel,” and “Nicolas” highlight key characters. References to “Troll,” “Flute,” and “Stone’s” evoke memorable plot points, such as the troll incident in the girls’ bathroom and the enchantment protecting the Philosopher’s Stone. Moreover, terms like “Ollivander” and “Remembrall” reflect the enchants used in the book.
Moving to “Chamber of Secrets”, words such as “Lockhart,” “Dobby,” and “Myrtle” refer to important characters. The presence of “Riddle”, “Diary” and “Basilisk” signifies the central mystery surrounding the Chamber of Secrets and Tom Riddle’s memory. Additionally, references to “Mandrakes” and “Aragog” evoke the magical creatures present in the school.
In “Prisoner of Azkaban”, words like “Lupin,” “Pettigrew,” and “Marge” evoke the central mysteries and characters of the book. The presence of “Black” and “Scabbers” mark the revelation of Sirius Black’s innocence and the true identity of Ron’s pet rat, Scabbers. Additionally, terms like “Dementors” and “Expecto” refer to the enchantment (Expecto Patronum) used with the dementors.
For “Goblet of Fire”, distinctive words like “Bagman,” “Winky,” and “Champions” are related to the Triwizard Tournament. Meanwhile, references to “Moody” and “Cedric” evoke pivotal events and characters central to the book’s climax, emphasizing themes of trust and betrayal.
In “Order of the Phoenix”, words such as “Umbridge,” “Defence,” and “Luna” reflect the tumultuous events at Hogwarts School and the rise of Dolores Umbridge to power. The presence of “Sirius” and “Tonks” signifies the involvement of key members of the Order of the Phoenix and the challenges they face in resisting Voldemort’s return. Additionally, terms like “Prophecy” and “Eaters” hint at the escalating conflict.
In “Half-Blood Prince”, “Slughorn,” “McLaggen,” and “Morfin” are key characters. Professor Slughorn’s role in divulging crucial information to Harry, as well as the introduction of characters like Cormac McLaggen and Morfin Gaunt, contribute to the development and revelations central to the book’s narrative. References to “Felix Felicis” and “Prophecy” allude to pivotal plot points of the book.
Lastly, in “Deathly Hallows”, words like “Xenophilius,” “Griphook,” “Hallows,” and “Horcrux” reflect the intense focus on the quest for the Deathly Hallows and the hunt for Voldemort’s Horcruxes, which drive much of the narrative tension. Characters such as “Luna” and “Kreacher” are significant players in this book. Additionally, names like “Greyback” and “Bellatrix” hint at the menacing presence of Death Eaters and the looming threat of darkness throughout the book.
Another good way to understand the context and content of each book
is using N-grams. N-grams are consecutive sequences of
words where n defines the number of words composing a
token. Here we will start working with them.
I will use the same function as in the beginning of the notebook but tokenizing by n-grams (bigrams in this case)
# Creating vectors of book titles and corresponding texts:
book_titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
"Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince", "Deathly Hallows")
#I have to first redefine the books as their format has been changed
philosophers_stone <- harrypotter::philosophers_stone
chamber_of_secrets <- harrypotter::chamber_of_secrets
prisoner_of_azkaban <- harrypotter::prisoner_of_azkaban
goblet_of_fire <- harrypotter::goblet_of_fire
order_of_the_phoenix <- harrypotter::order_of_the_phoenix
half_blood_prince <- harrypotter::half_blood_prince
deathly_hallows <- harrypotter::deathly_hallows
# Creating vectors of book texts
books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban,
goblet_of_fire, order_of_the_phoenix, half_blood_prince, deathly_hallows)
# Creating an empty tibble:
harry_potter_bigrams <- tibble()
# Looping through the book titles
for (i in seq_along(book_titles)) {
# Saving the cleaned text in the "clean" dataset
# Each chapter is represented as a different element inside the vector, so the way to access it is book[chapter]
clean <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) |>
# Tokenize the text into bigrams
unnest_tokens(bigram, text, token = "ngrams", n = 2) |>
# Assign book titles
mutate(book = book_titles[i]) |>
# First column is the book
select(book, everything())
# Stack the books together in rows
harry_potter_bigrams <- rbind(harry_potter_bigrams, clean)
}
# Set levels for the books according to their order of publication
harry_potter_bigrams$book <- factor(harry_potter_bigrams$book, levels = rev(book_titles))
# Eliminate the unnecessary dataset
rm(clean)
# This is our final dataset
harry_potter_bigrams
## # A tibble: 1,089,186 × 3
## book chapter bigram
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 the boy
## 2 Philosopher's Stone 1 boy who
## 3 Philosopher's Stone 1 who lived
## 4 Philosopher's Stone 1 lived mr
## 5 Philosopher's Stone 1 mr and
## 6 Philosopher's Stone 1 and mrs
## 7 Philosopher's Stone 1 mrs dursley
## 8 Philosopher's Stone 1 dursley of
## 9 Philosopher's Stone 1 of number
## 10 Philosopher's Stone 1 number four
## # ℹ 1,089,176 more rows
Each token now is a bigram, not a word.
Now then we can observe which pairs of words are the most frequent. Nonetheless before it would be necessary to filter stopwords to obtain valuable information:
In order to do so first we need to separate the bigrams in two columns, then filter the stopwords and then count the frequencies:
bigrams_separated <- harry_potter_bigrams |>
#we separate each bigram in two columns, word1 and word2
separate(bigram, c("word1", "word2"), sep = " ")
#we filter all words included in the word column in stop_words
bigrams_filtered <- bigrams_separated |>
filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered |>
#how many times in the bigrams_filtered they appear together
count(word1, word2, sort = TRUE)
bigram_counts
## # A tibble: 89,120 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 professor mcgonagall 578
## 2 uncle vernon 386
## 3 harry potter 349
## 4 death eaters 346
## 5 harry looked 316
## 6 harry ron 302
## 7 aunt petunia 206
## 8 invisibility cloak 192
## 9 professor trelawney 177
## 10 dark arts 176
## # ℹ 89,110 more rows
As we can see, the most frequent bigrams are proper nouns, name and surname combinations or title/relationship-name combinations.
Thus perhaps filtering these words we can get more useful information:
custom_stop_words <- bind_rows(tibble(word = c("top", "well", "led", "harry", "ron", "hermione", "weasley", "dumbledore", "professor", "malfoy", "potter", "snape", "harry's", "uncle","aunt", "madam", "madame", "voldemort", "lord", "yeah" ),
lexicon = c("custom")),
stop_words)
custom_stop_words
## # A tibble: 1,169 × 2
## word lexicon
## <chr> <chr>
## 1 top custom
## 2 well custom
## 3 led custom
## 4 harry custom
## 5 ron custom
## 6 hermione custom
## 7 weasley custom
## 8 dumbledore custom
## 9 professor custom
## 10 malfoy custom
## # ℹ 1,159 more rows
Now we filter our bigrams with the custom stopwords:
#we filter all words included our custom stop_words df
bigrams_filtered1 <- bigrams_separated |>
filter(!word1 %in% custom_stop_words$word) |>
filter(!word2 %in% custom_stop_words$word)
# new bigram counts:
bigram_counts1 <- bigrams_filtered1 |>
#how many times in the bigrams_filtered they appear together
count(word1, word2, sort = TRUE)
bigram_counts1
## # A tibble: 76,306 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 death eaters 346
## 2 invisibility cloak 192
## 3 dark arts 176
## 4 death eater 164
## 5 entrance hall 145
## 6 daily prophet 125
## 7 mad eye 116
## 8 hospital wing 107
## 9 prime minister 94
## 10 house elf 93
## # ℹ 76,296 more rows
We can see death eaters are quite relevant together with the invisibility cloak and the dark arts that wrap basically the two previous terms. Lastly, Mad Eye (also Moody) seems to be a quite important second character throughout the series.
After the bigrams are filtered, we can unite them again:
bigrams_united <- bigrams_filtered |>
unite(bigram, word1, word2, sep = " ")
bigrams_united
## # A tibble: 137,629 × 3
## book chapter bigram
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 privet drive
## 2 Philosopher's Stone 1 perfectly normal
## 3 Philosopher's Stone 1 firm called
## 4 Philosopher's Stone 1 called grunnings
## 5 Philosopher's Stone 1 usual amount
## 6 Philosopher's Stone 1 time craning
## 7 Philosopher's Stone 1 garden fences
## 8 Philosopher's Stone 1 fences spying
## 9 Philosopher's Stone 1 son called
## 10 Philosopher's Stone 1 called dudley
## # ℹ 137,619 more rows
Also we can visualize them:
library(ggraph)
set.seed(2017)
# we use the dataframe with bigram counted.
bigram_counts
## # A tibble: 89,120 × 3
## word1 word2 n
## <chr> <chr> <int>
## 1 professor mcgonagall 578
## 2 uncle vernon 386
## 3 harry potter 349
## 4 death eaters 346
## 5 harry looked 316
## 6 harry ron 302
## 7 aunt petunia 206
## 8 invisibility cloak 192
## 9 professor trelawney 177
## 10 dark arts 176
## # ℹ 89,110 more rows
# filter by the 20 more common combinations using n
bigram_for_graph <- bigram_counts |>
filter(n > 20) |>
graph_from_data_frame()
bigram_for_graph
## IGRAPH 2b42bf7 DN-- 341 291 --
## + attr: name (v/c), n (e/n)
## + edges from 2b42bf7 (vertex names):
## [1] professor ->mcgonagall uncle ->vernon harry ->potter
## [4] death ->eaters harry ->looked harry ->ron
## [7] aunt ->petunia invisibility->cloak professor ->trelawney
## [10] dark ->arts professor ->umbridge death ->eater
## [13] entrance ->hall madam ->pomfrey dark ->lord
## [16] professor ->dumbledore daily ->prophet lord ->voldemort
## [19] harry ->heard professor ->lupin mad ->eye
## [22] hospital ->wing draco ->malfoy harry ->harry
## + ... omitted several edges
# to introduce settings and improve our graph
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_for_graph, layout = "fr") + #layout is used to prevent nodes from overlapping
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,#plot edges
arrow = a, end_cap = circle(.07, 'inches')) + #plot nodes
geom_node_point(color = "lightblue", linewidth = 5) + #add text/the words
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
I used the bigrams with just the regular stopwords being filtered as this graph can display a lot of information and we don´t want to miss out on that. Most of the bigrams linked to Harry and also we can appreciate how professor has many other words tied to it, as well as Griffindor.
Actually the same could be done with trigrams, and perhaps this could shed some light on the content of the books as well:
We tokenize each book as a trigram and perform the same filtering and frequency counting as with the bigrams:
# Creating an empty tibble:
harry_potter_trigrams <- tibble()
# Looping through the book titles
for (i in seq_along(book_titles)) {
# Saving the cleaned text in the "clean" dataset
# Each chapter is represented as a different element inside the vector, so the way to access it is book[chapter]
clean <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) |>
# Tokenize the text into bigrams
unnest_tokens(trigram, text, token = "ngrams", n = 3) |>
# Assign book titles
mutate(book = book_titles[i]) |>
# First column is the book
select(book, everything())
# Stack the books together in rows
harry_potter_trigrams <- rbind(harry_potter_trigrams, clean)
}
# Set levels for the books according to their order of publication
harry_potter_trigrams$book <- factor(harry_potter_trigrams$book, levels = rev(book_titles))
# Eliminate the unnecessary dataset
rm(clean)
# This is our final dataset
harry_potter_trigrams
## # A tibble: 1,088,986 × 3
## book chapter trigram
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 the boy who
## 2 Philosopher's Stone 1 boy who lived
## 3 Philosopher's Stone 1 who lived mr
## 4 Philosopher's Stone 1 lived mr and
## 5 Philosopher's Stone 1 mr and mrs
## 6 Philosopher's Stone 1 and mrs dursley
## 7 Philosopher's Stone 1 mrs dursley of
## 8 Philosopher's Stone 1 dursley of number
## 9 Philosopher's Stone 1 of number four
## 10 Philosopher's Stone 1 number four privet
## # ℹ 1,088,976 more rows
We filter them :
trigrams_separated <- harry_potter_trigrams %>%
#we separate each bigram in two columns, word1 and word2
separate(trigram, c("word1", "word2", "word3"), sep = " ")
#we filter all words included in the word column in stop_words
trigrams_filtered <- trigrams_separated |>
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word)
# new bigram counts:
trigram_counts <- trigrams_filtered |>
#how many times in the trigrams_filtered they appear together
count(word1, word2, word3, sort = TRUE)
trigram_counts
## # A tibble: 42,686 × 4
## word1 word2 word3 n
## <chr> <chr> <chr> <int>
## 1 professor grubbly plank 42
## 2 quidditch world cup 39
## 3 mad eye moody 31
## 4 dark arts teacher 25
## 5 harry ron hermione 24
## 6 half moon spectacles 21
## 7 magical law enforcement 17
## 8 half blood prince 15
## 9 harry looked round 15
## 10 oak front doors 15
## # ℹ 42,676 more rows
And then unite them back:
trigrams_united <- trigrams_filtered |>
unite(trigram, word1, word2, word3, sep = " ")
trigrams_united
## # A tibble: 45,700 × 3
## book chapter trigram
## <fct> <int> <chr>
## 1 Philosopher's Stone 1 firm called grunnings
## 2 Philosopher's Stone 1 garden fences spying
## 3 Philosopher's Stone 1 son called dudley
## 4 Philosopher's Stone 1 dull gray tuesday
## 5 Philosopher's Stone 1 tawny owl flutter
## 6 Philosopher's Stone 1 owl flutter past
## 7 Philosopher's Stone 1 tabby cat standing
## 8 Philosopher's Stone 1 usual morning traffic
## 9 Philosopher's Stone 1 morning traffic jam
## 10 Philosopher's Stone 1 strangely dressed people
## # ℹ 45,690 more rows
Also, we can visualize them:
library(ggraph)
set.seed(2017)
# we use the dataframe with bigram counted.
trigram_counts
## # A tibble: 42,686 × 4
## word1 word2 word3 n
## <chr> <chr> <chr> <int>
## 1 professor grubbly plank 42
## 2 quidditch world cup 39
## 3 mad eye moody 31
## 4 dark arts teacher 25
## 5 harry ron hermione 24
## 6 half moon spectacles 21
## 7 magical law enforcement 17
## 8 half blood prince 15
## 9 harry looked round 15
## 10 oak front doors 15
## # ℹ 42,676 more rows
# filter by the 20 more common combinations using n
trigram_for_graph <- trigram_counts %>%
filter(n > 3) %>%
graph_from_data_frame()
# to introduce settings and improve our graph
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(trigram_for_graph, layout = "fr") + #layout is used to prevent nodes from overlapping
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
Again, we can see most trigrams are related to Harry, professors, untidy stuff, magical things and death.
Likewise, we can combine n-grams analysis with TF-IDF analysis to get a very informative output and instead of the most frequent bigrams we can get the most distinctive bigrams by book.
bigram_tf_idf <- bigrams_united %>%
#we count by book
count(book, bigram) %>%
#we perform tf_idf
bind_tf_idf(bigram, book, n) %>%
#we arrange in descending order
arrange(desc(tf_idf))
bigram_tf_idf
## # A tibble: 107,016 × 6
## book bigram n tf idf tf_idf
## <fct> <chr> <int> <dbl> <dbl> <dbl>
## 1 Order of the Phoenix professor umbridge 173 0.00533 1.25 0.00667
## 2 Prisoner of Azkaban professor lupin 107 0.00738 0.847 0.00625
## 3 Deathly Hallows elder wand 58 0.00243 1.95 0.00473
## 4 Goblet of Fire ludo bagman 49 0.00201 1.95 0.00391
## 5 Prisoner of Azkaban aunt marge 42 0.00290 1.25 0.00363
## 6 Deathly Hallows death eaters 139 0.00582 0.560 0.00326
## 7 Goblet of Fire madame maxime 89 0.00365 0.847 0.00309
## 8 Chamber of Secrets gilderoy lockhart 28 0.00232 1.25 0.00291
## 9 Half-Blood Prince advanced potion 27 0.00129 1.95 0.00252
## 10 Deathly Hallows deathly hallows 30 0.00126 1.95 0.00245
## # ℹ 107,006 more rows
And now we can plot it to better understand it:
bigram_tf_idf %>%
group_by(book) %>%
#choose maximum number of words
slice_max(tf_idf, n = 6) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(bigram, tf_idf), fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free") +
labs(x = "tf-idf", y = NULL)
This technique is better at showing characteristic events of each book like Myrtle´s scene with Harry potter crying in the bathroom or the expectro patronum that Harry was able to conjure in the Prisoner of Azkaban.
Comprehending the meaning and consequences of negation in language is important for a number of domains, such as sentiment analysis, natural language processing, and cognitive science. Negation, which is frequently indicated by words like “not,” “never,” or “no,” adds another level of complexity to language comprehension and interpretation since it sheds light on the sentiment and polarity of a text.
In the following lines its effects on the words of the books will be explored:
#get sentiments from afinn
AFINN <- get_sentiments("afinn")
Filtering the bigrams dataframe for those that contain the negation word “not” before another word:
nots <- harry_potter_bigrams |>
#separate the bigrams
separate(bigram, c("word1", "word2"), sep = " ") |>
#filter for bigrams with negation
filter(word1 == "not") %>%
inner_join(AFINN, by = c(word2 = "word")) |>
count(word2, value, sort = TRUE)
And now plot it :
nots |>
# create a contribution of each word for the whole corpus with score: times repeated x value x (-1)
mutate(contribution = n * value) |>
arrange(desc(abs(contribution))) |>
head(20) |>
ggplot(aes(reorder(word2, contribution), n * value, fill = n * value > 0)) +
geom_bar(stat = "identity", show.legend = FALSE) +
xlab("Words preceded by 'not'") +
ylab("Sentiment score * # of occurrances") +
coord_flip()
When it comes to making an interpretation we have to take into account that these words are precededd by the word not, so for example “not help”, which actually should have a very positive contribution to the sentiment analysis of the books in reality it means the opposite. The same goes for “not bad”, which in reality means that something is okay but in the analysis we could be performing only adds to the negative sentiment.
Bi-grams such as “not help,” “not want,” and “not like” could also be main causes of misidentification, leading to an excessively positive reading of the text (at least in comparison to what it should be).
Building on this, a more thorough analysis could include a long list of negation signal words, like “not,” “no,” “never,” and “without.” This wider focus would make it possible to identify a greater variety of words that come before negation and would make it easier to conduct a more thorough analysis of how these words affect the interpretation of sentiment.
negation_words <- c("not", "no", "never", "without")
(negated <- harry_potter_bigrams |>
separate(bigram, c("word1", "word2"), sep = " ") |>
filter(word1 %in% negation_words) |>
inner_join(AFINN, by = c(word2 = "word")) |>
count(word1, word2, value, sort = TRUE) |>
ungroup()
)
## # A tibble: 379 × 4
## word1 word2 value n
## <chr> <chr> <dbl> <int>
## 1 not want 1 81
## 2 no no -1 74
## 3 no doubt -1 53
## 4 not help 2 45
## 5 no good 3 38
## 6 not like 2 29
## 7 no chance 2 22
## 8 not care 2 22
## 9 no problem -2 21
## 10 no matter 1 19
## # ℹ 369 more rows
And now we plot it:
negated |>
mutate(contribution = n * value,
sign = if_else(value > 0, "postive", "negative")) %>%
group_by(word1) %>%
top_n(10, abs(contribution)) |>
ungroup() |>
ggplot(aes(y = reorder_within(word2, contribution, word1),
x = contribution,
fill = sign)) +
geom_col() +
scale_y_reordered() +
facet_wrap(~ word1, scales = "free") +
labs(y = 'Words preceeded by a negation',
x = "Contribution (Sent value * number of mentions)",
title = "Most common pos or neg words to follow negations")
We can see in the previous plots how negation has affected the sentiment analysis of certain words.
We may want to condition the n-grams to obtain just those containing a specific word. In our case we could look at the most frequent connotations/words associated to each of the harry potter houses:
# Gryffindor
trigrams_filtered |>
#I will look inside trigrams to seek adjectives before and after the house name
filter(word2 == "gryffindor") |>
count(book, word1, word2, word3, sort = TRUE)
## # A tibble: 52 × 5
## book word1 word2 word3 n
## <fct> <chr> <chr> <chr> <int>
## 1 Philosopher's Stone award gryffindor house 3
## 2 Order of the Phoenix season gryffindor versus 2
## 3 Prisoner of Azkaban left gryffindor tower 2
## 4 Deathly Hallows cried gryffindor harry 1
## 5 Deathly Hallows fellow gryffindor muggle 1
## 6 Deathly Hallows godric gryffindor gryffindor's 1
## 7 Deathly Hallows godric gryffindor harry's 1
## 8 Deathly Hallows gold gryffindor lion 1
## 9 Deathly Hallows set gryffindor apart.harry 1
## 10 Half-Blood Prince giant gryffindor hourglass 1
## # ℹ 42 more rows
# Hufflepuff
trigrams_filtered |>
filter(word2 == "hufflepuff") |>
count(book, word1, word2, word3, sort = TRUE)
## # A tibble: 14 × 5
## book word1 word2 word3 n
## <fct> <chr> <chr> <chr> <int>
## 1 Deathly Hallows countless hufflepuff cups 1
## 2 Deathly Hallows sneering hufflepuff zacharias 1
## 3 Order of the Phoenix blond hufflepuff player 1
## 4 Goblet of Fire distinguish hufflepuff house 1
## 5 Goblet of Fire eleanor hufflepuff cauldwell 1
## 6 Goblet of Fire glory hufflepuff house 1
## 7 Goblet of Fire owen hufflepuff creevey 1
## 8 Chamber of Secrets cheerful hufflepuff ghost 1
## 9 Chamber of Secrets gryffindor hufflepuff ravenclaw 1
## 10 Chamber of Secrets haired hufflepuff boy 1
## 11 Chamber of Secrets helga hufflepuff rowena 1
## 12 Philosopher's Stone gryffindor hufflepuff ravenclaw 1
## 13 Philosopher's Stone pause hufflepuff shouted 1
## 14 Philosopher's Stone susan hufflepuff shouted 1
# Ravenclaw
trigrams_filtered |>
filter(word2 == "ravenclaw") |>
count(book, word1, word2, word3, sort = TRUE)
## # A tibble: 10 × 5
## book word1 word2 word3 n
## <fct> <chr> <chr> <chr> <int>
## 1 Deathly Hallows inside ravenclaw tower 2
## 2 Deathly Hallows deserted ravenclaw common 1
## 3 Deathly Hallows rowena ravenclaw lay 1
## 4 Deathly Hallows rowens ravenclaw wit 1
## 5 Half-Blood Prince gryffindor ravenclaw game 1
## 6 Order of the Phoenix immediately ravenclaw captain 1
## 7 Goblet of Fire stool ravenclaw shouted 1
## 8 Prisoner of Azkaban gryffindor ravenclaw hufflepuff 1
## 9 Prisoner of Azkaban percy's ravenclaw girlfriend 1
## 10 Prisoner of Azkaban tower ravenclaw played 1
# Slytherin
trigrams_filtered |>
filter(word2 == "slytherin") |>
count(book, word1, word2, word3, sort = TRUE)
## # A tibble: 25 × 5
## book word1 word2 word3 n
## <fct> <chr> <chr> <chr> <int>
## 1 Deathly Hallows cthen slytherin house 1
## 2 Deathly Hallows head slytherin cried 1
## 3 Half-Blood Prince salazar slytherin hankering 1
## 4 Half-Blood Prince single slytherin malfoy 1
## 5 Order of the Phoenix hoop slytherin score 1
## 6 Order of the Phoenix quaffle slytherin captain 1
## 7 Order of the Phoenix stringy slytherin boy 1
## 8 Order of the Phoenix versus slytherin drew 1
## 9 Goblet of Fire graham slytherin quirke 1
## 10 Goblet of Fire hungry slytherin loved 1
## # ℹ 15 more rows
Words associated with Gryffindor include always house (like the other houses), as well as tower (referring to its location), quidditch which makes reference to the games between houses and also “fellow gryffindor muggle” (perhaps referring to hermione).
The Hufflepuff house is mostly related to nous and words related to the games such as “blond hufflepuff player”, “countless hufflepuff cups” or “glory hufflepuff house”.
More or less the same happens with the Ravenclaw house, as we can see often next to it words like “game”, “captain”, “played”, “game” and “tower”.
Lastly, Slytherin can be related as well to games (with words like “captain”, “score”, and “hoop”) as well as to girls and one trigram stands out: “stringy slytherin boy”, which is probably referring to Draco.
We have been working with pairs of adjacent words that always go together. Let’s look know to pairs of words that appear in the same context, but not necessarily together.
Let´s take the book “Half-Blood Prince” as a sample for our analysis:
harry_potter_books <- harry_potter_books |>
mutate(wordcount = row_number())
#Let´s take the half-blood Prince book for example
HP_section_words <- harry_potter_books |>
filter(book == "Half-Blood Prince") |>
mutate(section = wordcount %/% 100) |>
filter(section > 0) |>
filter(!word %in% stop_words$word)
HP_section_words
## # A tibble: 63,098 × 5
## book chapter word wordcount section
## <fct> <int> <chr> <int> <dbl>
## 1 Half-Blood Prince 1 nearing 272835 2728
## 2 Half-Blood Prince 1 midnight 272836 2728
## 3 Half-Blood Prince 1 prime 272837 2728
## 4 Half-Blood Prince 1 minister 272838 2728
## 5 Half-Blood Prince 1 sitting 272839 2728
## 6 Half-Blood Prince 1 office 272840 2728
## 7 Half-Blood Prince 1 reading 272841 2728
## 8 Half-Blood Prince 1 memo 272842 2728
## 9 Half-Blood Prince 1 slipping 272843 2728
## 10 Half-Blood Prince 1 brain 272844 2728
## # ℹ 63,088 more rows
The pairwise_count() function gives us one row for each
pair of words, and the number of times they co-appeared in the same
section of 10 lines. It belongs to the package widyr.
library(widyr)
word_pairs <- HP_section_words %>%
pairwise_count(word, section, sort = TRUE)
word_pairs
## # A tibble: 3,201,646 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 ron harry 274
## 2 harry ron 274
## 3 dumbledore harry 253
## 4 harry dumbledore 253
## 5 hermione harry 249
## 6 harry hermione 249
## 7 harry looked 240
## 8 looked harry 240
## 9 ron hermione 218
## 10 hermione ron 218
## # ℹ 3,201,636 more rows
As it can be seen, mostly names are co-appearing in the book.
Now let´s filter to see which words often share context with each of the main characters and then perform a sentiment analysis on it:
(this analysis only applies to the Half-Blood Prince book)
main_characters <- word_pairs %>%
filter(item1 == "harry" | item1 == "hermione" | item1 == "ron")
main_characters
## # A tibble: 23,343 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 ron harry 274
## 2 harry ron 274
## 3 harry dumbledore 253
## 4 hermione harry 249
## 5 harry hermione 249
## 6 harry looked 240
## 7 ron hermione 218
## 8 hermione ron 218
## 9 harry time 208
## 10 harry hand 147
## # ℹ 23,333 more rows
Add the sentiment contribution for each co-appearing word that goes together with the names:
main_characters_sentiment <- main_characters |>
inner_join(AFINN, by = c(item2 = "word")) |>
count(item1, item2, value, sort = TRUE)
main_characters_sentiment <- main_characters_sentiment %>%
#create a column called contribution to store mentions in the corpus x value
mutate(contribution = n * value) |>
arrange(desc(abs(contribution))) |>
mutate(item2 = reorder(item2, contribution))
main_characters_plot <- main_characters_sentiment |>
group_by(item1) |>
summarise(total_contribution = sum(contribution))
And now we plot it:
ggplot(main_characters_plot, aes(x = item1, y = total_contribution, fill = item1)) +
geom_bar(stat = "identity") +
labs(title = "Total Contribution for Main Characters",
x = "Character", y = "Total Contribution",
fill = "Character") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("harry" = "dodgerblue3", "hermione" = "deeppink3", "ron" = "darkolivegreen4"))
The words that most frequently co-appear with the names of each of the characters are quite negative, yet we should be expecing this by now given that the series in general is very negative. We can see that Harry has a startingly low score, which in reality makes sense given that he is the main character and the one that almost all villians want to kill.
Correlation among words indicates how often they appear nearby relative to how often they appear separately.
When looking at a corpus, the Phi coefficient measures how likely it is that two words appear together taking into account the probability for each word of appearing alone.
To perfom this analysis the pairwise_cor() function is used instead:
#Now I want to perform an analysis on all the books so I remove the book filter
HP_section_words <- harry_potter_books |>
mutate(section = wordcount %/% 100) |>
filter(section > 0) |>
filter(!word %in% stop_words$word)
#And apply the pairwise_cor function to obtain correlations
word_cors <- HP_section_words %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, section, sort = TRUE)
word_cors
## # A tibble: 11,212,452 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 patronum expecto 1
## 2 expecto patronum 1
## 3 grubbly plank 0.965
## 4 plank grubbly 0.965
## 5 kedavra avada 0.923
## 6 avada kedavra 0.923
## 7 felicis felix 0.913
## 8 felix felicis 0.913
## 9 maxime madame 0.899
## 10 madame maxime 0.899
## # ℹ 11,212,442 more rows
Here only regular stopwords have been filtered.
Let´s try to see the most correlated words with four words that are quite common across the books. Now I will plot it:
word_cors |>
#we define a vector for 4 words
filter(item1 %in% c("wand", "voldemort", "dumbledore", "death")) |>
#we group by item1
group_by(item1) |>
#we use the first 6 most correlated
slice_max(correlation, n = 7) |>
ungroup() |>
#we reorder item2 regarding its correlation
mutate(item2 = reorder(item2, correlation)) |>
#we plot
ggplot(aes(item2, correlation, fill=item1)) +
geom_bar(stat = "identity") +
facet_wrap(~ item1, scales = "free") +
coord_flip()
This can be interpreted as follows:
Death is correlated with Death Eaters, indicating strong association with dark forces. Also correlated with Voldemort and a curse and prophecy, suggesting involvement in significant plot elements.
Dumbledore is correlated with his own first name (Albus), but also with headmaster” and “voldemort,” reflecting his role as Hogwarts’ headmaster and his conflicts with Voldemort.
Voldemort is correlated with “voldemort’s,” “lord,”highlighting possessions and dark magical connotation. Also correlated with “death” and “dumbledore,” indicating his responsibility for Dumbledore´s death.
Wand is correlated with actions like “raised” and “flew,” suggesting activities related to wand usage. Also with features like “tip” and “elder,” reflecting characteristics of wands.
Also, we can visualize the correlations in a plot:
set.seed(2016)
word_cors %>%
filter(correlation > 0.5) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
This is an alternative way of visualising the previously explained plot.
First I wanted to filter some expressions that are to me like stopwords common to the way of speaking of each of the main characters:
characters_stop_words <- bind_rows(tibble(word = c("ron's", "harry's", "d'you", "bit", "yeah", "hermione's", "dunno", "reckon", "thinking", "ing"),
lexicon = c("characters")),
stop_words)
Once this is done we can plot the words for the main characters:
word_cors |>
# Filter out stopwords
filter(!item1 %in% characters_stop_words$word, !item2 %in% characters_stop_words$word) |>
#we define a vector for 4 words
filter(item1 %in% c("harry", "hermione", "ron")) |>
#we group by item1
group_by(item1) |>
#we use the first 6 most correlated
slice_max(correlation, n = 15) |>
ungroup() |>
#we reorder item2 regarding its correlation
mutate(item2 = reorder(item2, correlation)) %>%
#we plot
ggplot(aes(item2, correlation, fill=item1)) +
geom_bar(stat = "identity") +
facet_wrap(~ item1, scales = "free") +
coord_flip()+
labs(fill = "Character") # Change legend title
Here the interpretation gets more interesting:
The top word for each of the characters reveals the “love triangle” that characterises the saga. Although it is not really a love triangle, Harry´s bestfriend is Ron since the beginning but then Hermione and Ron develop a thing for each other. This is crystallised in Harry having Ron but Ron and Hermione having each other instead as main words.
Harry´s main words are related to other characters (such as Slughorn, Dobby ,Malfoy or Ginny - his lover-) and to dramatic words related to the main story such as scar, yelled, feeling or uncle
Ron´s main words are related to his siblings (Fred, Ginny and George), to Scrabbers (his rat) and to Griffyndor and the activities he developed there (homework, breakfast…)
Hermione´s words are related to school stuff (homework, lesson, library, class, dean…) which reveal that she was a good student.
Actually, this search can be better done with the kwic() function from the quanteda package, which puts certain keywords in context. We can use the books directly with this function, so especially in the first, second and fourth book it is talked about the houses, so let´s put together these books and analyse the context of the houses:
houses_texts <- c(
philosophers_stone,
chamber_of_secrets,
goblet_of_fire
)
# Combine all the texts into a single string
houses_texts <- paste(houses_texts, collapse = " ")
Context analysis of the houses:
First we prepare the window of 10 words for each house and then bind them into a single tibble:
gryffindor <- kwic(houses_texts, "gryffindor",valuetype="regex", window=10)
hufflepuff <- kwic(houses_texts, "hufflepuff", valuetype="regex", window=10)
ravenclaw <- kwic(houses_texts, "ravenclaw",valuetype="regex", window=10)
slytherin <- kwic(houses_texts, "slytherin", valuetype="regex",window=10)
Combine into a single tibble:
library(dplyr)
# Combine the vectors into a single tibble
houses_tibble <- bind_rows(
gryffindor %>% as_tibble(),
hufflepuff %>% as_tibble(),
ravenclaw %>% as_tibble(),
slytherin %>% as_tibble()
)
houses_tibble <- houses_tibble |>
#unite the text before and after the word with the word
unite(text, pre, keyword, post, sep = " ", remove = FALSE) |>
#remove irrelevant variables
select(-docname, -from, -to, -pre, -keyword, -post)
Create a corpus object:
# Load required libraries
library(dplyr)
library(tidyr)
library(tidytext)
library(ggplot2)
# Step 1: Perform sentiment analysis
sentiment_analysis <- houses_tibble %>%
unnest_tokens(word, text) %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(pattern) %>%
summarise(sentiment_score = mean(value)) %>%
ungroup()
# Step 2: Join the sentiment analysis results with the original tibble
houses_tibble <- left_join(houses_tibble, sentiment_analysis, by = "pattern")
# Step 4: Plot the sentiment scores for each house
ggplot(sentiment_analysis, aes(x = pattern, y = sentiment_score, fill = pattern)) +
geom_bar(stat = "identity") +
labs(title = "Sentiment Analysis by Harry Potter House",
x = "House",
y = "Average Sentiment Score") +
theme_minimal() +
theme(legend.position = "none")
The sentiment analysis by Harry Potter house reveals varying degrees of positivity associated with each house.
Hufflepuff emerges with the highest sentiment score of 0.687, indicating a predominantly positive sentiment, likely reflecting qualities such as loyalty and inclusivity.
Gryffindor follows with a score of 0.506, suggesting a moderately positive sentiment attributed to bravery and heroism.
Slytherin and Ravenclaw both exhibit moderately positive sentiments, with scores of 0.283 and 0.282, respectively. These scores may reflect the ambitious and cunning nature of Slytherin, as well as the intelligence and wit of Ravenclaw students.
To sum up, while each house demonstrates positive sentiments, Hufflepuff stands out as the most positively perceived house in this analysis, yet this could be subject to the characters associated with each house, which play a significant role in shaping perceptions and it is precisely the three main characters the ones that encounter the most problems subject to negative connotation or sentiment.
To make some topic modelling and better understand the content of the
saga it is necessary to convert our harry potter dataframe into a DMT to
later use the LDA()
function from the topicmodels package to create a seven-topic LDA model
(one per book).
Prepare the Harry Potter Books dataframe:
#design the stopwords to be filtered for topic modelling
topic_stop_words <- bind_rows(tibble(word = c("top", "well", "led", "harry", "ron", "hermione", "weasley", "professor", "potter", "harry's", "madam", "madame", "looked", "dumbledore", "yeah"),
lexicon = c("topic_modelling")),
stop_words)
library(tm)
#select the dataset that has the necessary variables for creating the DTM
harry_potter_dtm <- book_words |>
select(book, word, n)|> #data1
anti_join(topic_stop_words, join_by(word))#data2, which is the just designed custom stop words that also filters main characters names
# Define the order of the books
book_order <- c(
"Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
"Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince", "Deathly Hallows"
)
# Convert book names to numbers based on the specified order
book_to_number <- function(book_name) {
match(book_name, book_order)
}
# Rename columns and convert book names to numbers
harry_potter_dtm_formatted <- harry_potter_dtm %>%
mutate(
document = book_to_number(book),
term = word,
count = n
) %>%
select(document, term, count)
harry_potter_dtm <- harry_potter_dtm_formatted %>%
#we use the cast function with the three columns needed
cast_dtm(document, term, count)
harry_potter_dtm
## <<DocumentTermMatrix (documents: 7, terms: 23781)>>
## Non-/sparse entries: 63556/102911
## Sparsity : 62%
## Maximal term length: 24
## Weighting : term frequency (tf)
Quite low sparsity, meaning it uses more or less the same vocabulary across the whole saga
Now we fit the LDA model. I will choose 2 topics despite the fact that there are 7 books, but it is a saga and they all are tightly related:
harry_potter_lda <- LDA(harry_potter_dtm, k = 2, control = list(seed = 1234))
harry_potter_lda
## A LDA_VEM topic model with 2 topics.
And now we tidy it back:
harry_potter_topics <- tidy(harry_potter_lda)
Let’s find the 5 most common words for each topic and plot them.
top_terms <- harry_potter_topics |>
group_by(topic) |>
slice_max(beta, n = 15) |>
ungroup() |>
arrange(topic, -beta)
top_terms
## # A tibble: 30 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 head 0.00700
## 2 1 hagrid 0.00651
## 3 1 hogwarts 0.00409
## 4 1 hand 0.00401
## 5 1 wand 0.00398
## 6 1 voice 0.00360
## 7 1 death 0.00349
## 8 1 fred 0.00338
## 9 1 door 0.00316
## 10 1 heard 0.00308
## # ℹ 20 more rows
top_terms |>
mutate(term = reorder_within(term, beta, topic)) |>
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()
It seems like there is just a single topic as many terms belong to both topics modeled and also terms in both topics are quite related, therefore there is nothing distinctive of them.
This could be due to the low sparsity of the saga, but it gives us a reason to say that it is an easy to follow saga with many iterations across all of the books.
Finally, and although the analysis has been interpreted “on the go”, it can indeed be concluded that the sentiment of the saga becomes more negative as the story unfolds.